Search CORE

9 research outputs found

Full-System GPU Design Space Exploration

Author: Franke Bjoern
Kaszyk Kuba
Publication venue
Publication date: 12/08/2020
Field of study

Simulation methodologies for mobile GPUs

Author: Kaszyk Kuba
Publication venue: The University of Edinburgh
Publication date: 17/03/2022
Field of study

GPUs critically rely on a complex system software stack comprising kernel- and user-space drivers and JIT compilers. Yet, existing GPU simulators typically abstract away details of the software stack and GPU instruction set. Partly, this is because GPU vendors rarely release sufficient information about their latest GPU products. However, this is also due to the lack of an integrated CPU-GPU simulation framework, which is complete and powerful enough to drive the complex GPU software environment. This has led to a situation where research on GPU architectures and compilers is largely based on outdated or greatly simplified architectures and software stacks, undermining the validity of the generated results. Making the situation even more dire, existing GPU simulation efforts are concentrated around desktop GPUs, making infrastructure for modelling mobile GPUs virtually non-existent, despite their surging importance in the GPU market. Still, mobile GPU designers are faced with the challenge of evaluating design alternatives involving hundreds of architectural configuration options and micro-architectural improvements under tight time-to-market constraints, to which currently employed design flows involving detailed, but slow simulations are not well suited. In this thesis we develop a full-system simulation environment for a mobile platform, which enables users to run a complete and unmodified software stack for a state-of-the-art mobile Arm CPU and Mali Bifrost GPU powered device, achieving 100\% architectural accuracy across all available toolchains. We demonstrate the capability of our GPU simulation framework through a number of case studies exploring modern, mobile GPU applications, and optimize them using functional simulation statistics, unavailable with other approaches or hardware. Furthermore, we develop a trace-based performance model, allowing architects to rapidly model GPU configurations in early design space exploration

Edinburgh Research Archive

HETSIM: Simulating Large-Scale Heterogeneous Systems using a Trace-driven, Synchronization and Dependency-Aware Framework

Author: Cole Murray
Dreslinski Ronald
Kaszyk Kuba
O'Boyle Michael F P
Pal Subhankar
Publication venue
Publication date: 12/08/2020
Field of study

Edinburgh Research Explorer

Performance Aware Convolutional Neural Network Channel Pruning for Embedded GPUs

Author: Cano Jose
Crowley Elliot
Franke Bjoern
Kaszyk Kuba
O'Boyle Michael
Radu Valentin
Storkey Amos
Turner Jack
Wen Yuan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Convolutional Neural Networks (CNN) are becoming a common presence in many applications and services, due to their superior recognition accuracy. They are increasingly being used on mobile devices, many times just by porting large models designed for server space, although several model compression techniques have been considered. One model compression technique intended to reduce computations is channel pruning. Mobile and embedded systems now have GPUs which are ideal for the parallel computations of neural networks and for their lower energy cost per operation. Specialized libraries perform these neural network computations through highly optimized routines. As we find in our experiments, these libraries are optimized for the most common network shapes, making uninstructed channel pruning inefficient. We evaluate higher level libraries, which analyze the input characteristics of a convolutional layer, based on which they produce optimized OpenCL (Arm Compute Library and TVM) and CUDA (cuDNN) code. However, in reality, these characteristics and subsequent choices intended for optimization can have the opposite effect. We show that a reduction in the number of convolutional channels, pruning 12% of the initial size, is in some cases detrimental to performance, leading to 2× slowdown. On the other hand, we also find examples where performance-aware pruning achieves the intended results, with performance speedups of 3× with cuDNN and above 10× with Arm Compute Library and TVM. Our findings expose the need for hardware-instructed neural network pruning

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

Enlighten

HETSIM: Simulating Large-Scale Heterogeneous Systems using a Trace-driven, Synchronization and Dependency-Aware Framework

Author: Cole Murray
Dreslinski Ronald G.
Feng Siying
Franke Bjoern
Kaszyk Kuba
Mudge Trevor
O'Boyle Michael F P
Pal Subhankar
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 19/11/2020
Field of study

Crossref

Edinburgh Research Explorer

CoSPARSE: A Software and Hardware Reconfigurable SpMV Framework for Graph Analytics

Author: Chakrabarti Chaitali
Cole Murray
Dreslinski Ronald
Feng Siying
He Xin
Kaszyk Kuba
Morton Magnus
Mudge Trevor
O'Boyle Michael
Pal Subhankar
Park Dong-hyeon
Sun Jiawen
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/11/2021
Field of study

Edinburgh Research Explorer

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

Author: Ahmadi Agreen
Austin Todd
Behroozi Armand
Dreslinski Ronald
Kaszyk Kuba
Li Lu
Mahlke Scott
May Kyle
Morton John Magnus
Mudge Trevor
Nguyen Brandon
O'Boyle Michael F P
Sun Jiawen
Talati Nishil
Vasiladiotis Christos
Verma Tarunesh
Yang Yichen
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/04/2021
Field of study

Edinburgh Research Explorer

Transmuter: Bridging the Efficiency Gap using Memory and Dataflow Reconfiguration

Author: Amarnath Aporva
Beaumont Jonathan
Blaauw David
Chakrabarti Chaitali
Cole Murray
Dreslinski Ronald
Feng Siying
He Xin
Kaszyk Kuba
Kim Hun-Seok
Kim Sung
May Kyle
Morton John Magnus
Mudge Trevor
O'Boyle Michael
Pal Subhankar
Park Dong-hyeon
Sun Jiawen
Xiong Yan
Yang Chi-Sheng
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 30/09/2020
Field of study

Crossref

Edinburgh Research Explorer

Navigating the Landscape for Real-time Localisation and Mapping for Robotics, Virtual and Augmented Reality

Author: Bodin Bruno
Clarkson James
Davison Andrew J
Debrunner Thomas
Franke Bjoern
Furber Steve
Gonzalez-de-Aledo Pablo
Gorgovan Cosmin
Kaszyk Kuba
Kelly Paul H. J.
Kotselidis Christos
Luján Mikel
Mawer John
Melot Nicolas
Nardi Luigi
Nisbet Andy
O'Boyle Michael
Palomar Oscar
Riley Graham
Rodchenko Andrey
Saeedi Sajad
Spink Tom
Tomusk Erik-Arne
Vespa Emanuele
Wagstaff Harry
Webb Andrew
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 29/06/2018
Field of study

Visual understanding of 3D environments in real-time, at low power, is a huge computational challenge. Often referred to as SLAM (Simultaneous Localisation and Mapping), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, virtual and augmented reality. This paper describes the results of a major research effort to assemble the algorithms, architectures, tools, and systems software needed to enable delivery of SLAM, by supporting applications specialists in selecting and configuring the appropriate algorithm and the appropriate hardware, and compilation pathway, to meet their performance, accuracy, and energy consumption goals. The major contributions we present are (1) tools and methodology for systematic quantitative evaluation of SLAM algorithms, (2) automated, machine-learning-guided exploration of the algorithmic and implementation design space with respect to multiple objectives, (3) end-to-end simulation tools to enable optimisation of heterogeneous, accelerated architectures for the specific algorithmic requirements of the various SLAM algorithmic approaches, and (4) tools for delivering, where appropriate, accelerated, adaptive SLAM solutions in a managed, JIT-compiled, adaptive runtime context.Comment: Proceedings of the IEEE 201

arXiv.org e-Print Archive

Edinburgh Research Explorer

Spiral - Imperial College Digital Repository

The University of Manchester - Institutional Repository

University of St. Andrews - Pure

St Andrews Research Repository